MfcROCOPY  RESOLUTION  TEST  CHART 

NATIONAL  BUREAU  OF  STANDARDS  1963 -A 


Technical  Report  PTR-1 092-80-8 
Contract  No.  N00014-80-C-0150 
Work  Unit  No.  NR  197-064 
August  1980 


HOW  WELL  DO  PROBABILITY  EXPERTS 
ASSESS  PROBABILITIES? 

SARAH  LICHTENSTEIN 
BARUCH  FISCHHOFF 


DECISION  RESEARCH 
A  BRANCH  OF  PERCEPTRONICS 


Prepared  For: 

OFFICE  OF  NAVAL  RESEARCH 
Department  of  the  Navy 
800  North  Quincy  Street 
Arlington,  Virginia  22217 


PERCEPTRONICS 


•271  VARIEL  AVENUE  •  WOODLAND  HILLS  •  CALIFORNIA  »1M7  •  PHONE  (212)  M4-7470 


security  CLASSIFICATION  OF  This  FACE  (*ham  Data  tntamad) 


REPORT  DOCUMENTATION  PAGE 


PTR-1092-80^8 


E  (mi  Submit) 


READ  INSTRUCTIONS 
BEFORE  COMPLETING  FORM 


iRIEnT'S  CATALOG  NUMBER 


How  well  do  probability  experts  assess  [ 
probabilities?  ^ _ 


m>Ttl t&EZ&fJm 


‘^Sarah/^iichtensteln  ^Baruch ^lschhof f 


rerforming  organization  name  and  address 

Decision  Research  , 

A  Branch  of  Tei’qeptronics  t 

1201  Oak  Street,  Eugene,  Oregon  97401 


II-  CONTROLLING  OFFICE  NAME  AND  ADDRESS 

Office  of  Naval  Research 
Arlington,  Virginia  22217 


10.  RROGRAM  ELEMENT.  PROJECT.  TASK 
AREA  A  WORK  UNIT  NUMRERS 


Work  Unit  NR197-064 


NUMBER  OF  RAGES 

22 


MONITORING  AGENCY  name  a  AODRESSrir  Micron I  from  Controlling  Ollleo)  IS.  SECURITY  CLASS,  (o I  thit  report) 


w. 


unclassified 


ISa.  DECLASSIFICATION/ DOWNGRADING 
SCHEDULE 


IS.  DISTRIBUTION  STATEMENT  (ol  Hilt  Roport) 


approved  for  public  release;  distribution  unlimited 


17.  DISTRIBUTION  STATEMENT  (ol  tilt  okotroct  anlsraW  In  Block  20,  II  Mloront  Iron i  Roport) 


KlY  WORM  (Continue  on  rorarao  a  I  dm  It  nocooooay  and  identify  by  block  i 

Calibration 
probability  assessment 


ABSTRACT  (Cmnttnum  on  romaroo  ml  dm  It  nmcmaaary  and  Idantlty  by  block  mrnabar) 

ast  research  on  people's  ability  to  assess  probabilities  has  shown  two  common 
errors,  overconfidence  in  one's  knowledge  and  insensitivity  to  task  difficulty. 
This  research  has  created  a  new  class  of  experts:  those  who  have  studied 
probability  assessors  and  who  are  aware  of  the  common  errors.  The  performance 
of  eight  such  experts  is  here  compared  to  the  performance  of  twelve  untrained 
subjects  and  fifteen  who  had  previously  received  training  in  probability  assesa 
ment.  All  subjects  responded  to  500  general-knowledge  items  whose  difficulty 
could  be  measured  a  priori  from  the  item  context .  The  experts  appeared  to>) 
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have  overcorrected  for  the  overconfidence  error:  they  were  notably  under- 
confident,  whereas  the  untrained  subjects  were  overconfident  and  the  trained 
subjects  were  mixed.  The  experts  were  more  sensitive  than  the  other  two 
groups  to  variations  in  item  difficulty.  However,  even  they  showed  a  sub¬ 
stantial  insensitivity  to  difficulty,  relative  to  ideal  performance.  Intro¬ 
spection  suggests  that  this  second  error  would  be  hard  to  overcome. 
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Subjective  assessment  of  probabilities  has  in  recent  years 
increasingly  been  recognized  as  an  integral  part  of  decision  making, 
both  personal  (cf.  Jungermann,  1980)  and  public  (e.g.t  the  "Rasmussen 
Report,"  U.S.  Nuclear  Regulatory  Commission,  1975).  This  recognition 
has  led  to  a  burgeoning  research  literature  on  people's  abilities  to 
make  such  assessments  (see  Lichtenstein,  Fischhoff  &  Phillips,  1977). 
Typically,  participants  in  this  research  are  presented  with  a  series  of 
two-alternative,  forced-choice  questions.  For  each  question,  the 
assessor  first  chooses  the  correct  alternative  and  then  assesses  the 
probability  chat  the  chosen  alternative  is  in  fact  correct.  Analyses 
of  these  data  have  focused  on  calibration.  An  assessor  is  well 
calibrated  if,  over  the  long  run,  for  all  alternatives  assigned  a  given 
probability,  the  proportion  of  true  alternatives  is  equal  to  the 
probability  assigned.  Thus,  for  example,  just  70Z  of  all  alternatives 
assigned  a  probability  of  .7  should  be  correct.  When  the  assessed 
probabilities  are  larger  chan  the  proportions  correct  (e.g.,  90Z  confident 
but  only  75Z  correct),  the  assessors  are  called  "overconfident."  The 
reverse  situation  is  called  "underconfidence." 

Two  robust  findings  have  emerged  from  this  research.  First,  people 
are  usually  overconfident;  they  believe  they  know  more  tb  n  they  actually 
know.  Such  overconfidence  has  been  demonstrated  in  a  variety  of  tasks 
(Fischhoff  &  Slovic,  1980),  response  modes  (Fischhoff,  Slovlc  & 
Lichtenstein,  1977),  and  subject  populations  (Wright,  Phillips,  Whalley, 
Choo,  Ng,  Tan  &  Wisuda,  1978;  Cambridge  &  Shreckengost ,  Note  1). 

The  second  general  finding  is  that  the  degree  of  overconfidence  is 
related  to  the  overall  difficulty  of  the  task.  People  are  most 
overconfident  with  the  hardest  tasks  (Clarke,  1960;  Nickerson  &  McGoldrick 


1965;  Pltz,  1974).  As  Cask  difficulty  decreases,  so  does  overconfidence, 
until,  with  quite  easy  tasks,  people  are  underconfident  (Lichtenstein  & 
Fischhoff,  1977).  Apparently,  people  are  Insufficiently  sensitive  to 
task  difficulty,  and  fail  to  shift  the  distribution  of  their  probabilistic 
responses  as  much  as  they  should  as  task  difficulty  changes. 

Recent  attempts  to  reduce  overconfidence  by  training  (Lichtenstein  & 
Fischhoff,  1980)  or  improved  task  design  (Koriat,  Lichtenstein  & 

Fischhoff,  1980)  have  been  moderately  successful.  However,  no  one  has 
managed  to  enhance  sensitivity  to  task  difficulty. 

All  this  research  has  produced  a  new  kind  of  expertise:  people  who 
have  studied  probability  assessors  and  who  are  aware  of  common  errors. 

The  present  study  explores  such  experts'  ability  to  use  their  knowledge 
to  overcome  the  errors  exhibited  by  naive  assessors.  Their  performance 
is  here  compared  with  that  of  naive  subjects  and  of  subjects  who  had 
previously  been  trained  to  be  well  calibrated. 

Method 

Subjects 

Experts.  The  eight  expert  subjects,  five  males  and  three  females. 
Included  the  present  authors,  two  of  their  research  assistants,  and  four 
other  psychologists  who  have  done  research  in  probability  assessment.^ 

All  reported  having  read  the  research  literature  on  calibration  and 
overconfidence . 

Trained  subjects.  In  a  previous  paper  (Lichtenstein  &  Fischhoff, 
1980) ,  we  reported  the  results  of  two  studies  in  which  we  trained  24 
subjects  to  be  well  calibrated,  using  individualized  feedback  about 


calibration  after  each  of  3  or  11  sessions  of  200-item,  general-knowledge 
tests.  Sixteen  (8  females  and  8  males)  of  those  24  subjects  agreed  to 


serve  in  Che  present  experiment.  Of  these,  six  had  received  11  sessions 
of  previous  training,  during  which  they  had  responded  to  the  500  items 
used  in  the  present  study,  randomly  intermixed  with  2,500  items  covering 
other  topics.  However,  since  subjects  had  not  been  told  the  correct 
answer  to  any  item  during  their  training,  and  since  a  year  had  elapsed 
between  the  end  of  training  and  the  present  experiment,  we  felt  that 
having  previously  seen  the  test  items  would  not  significantly  affect 
their  present  performance.  The  other  ten  trained  subjects  had  received 
only  three  sessions  of  training,  and  thus  had  been  exposed  to  only  a 
small  fraction  of  the  items  used  here.  Three  to  six  months  had  elapsed 
between  their  training  and  the  present  experiment. 

Ontrained  subjects.  The  untrained  subjects  were  13  people  (9  males 
and  4  females)  who  responded  to  a  job  listing  at  the  University  of 
Oregon  branch  of  the  State  Employment  Division. 

Stimuli 

The  500  items  were  of  three  types.  The  first  289  items  listed  pairs 
of  continents,  countries,  states,  or  cities;  the  task  was  to  indicate 
which  was  more  populous  (e.g.,  [a]  Las  Vegas,  [b]  Miami;  [a]  Helsinki, 
Finland,  [b]  Milan,  Italy).  The  next  111  items  listed  a  base  city 
followed  by  two  other  cities;  the  task  was  to  indicate  which  of  the  two 
alternative  cities  was  farther  in  distance  from  the  base  city  (e.g., 
Melbourne:  [a]  Rome,  [b]  Tokyo).  The  final  100  items  listed  two 
historical  events;  the  task  was  to  Indicate  which  event  happened  first 
(e.g.,  [a]  Magna  Carta  signed,  [b]  Mohammed  born). 

The  items  were  selected  by  our  secretaries  from  almanacs,  under 
general  (and  vague)  instructions  not  to  make  the  test  too  hard  or  too 
easy  and  to  avoid  deceptive  items  (i.e. ,  those  that  might  be  answered 
incorrectly  by  most  people) . 
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Difficulty.  In  previous  research  on  the  relationship  between  task 
difficulty  and  calibration,  difficulty  was  defined  either  intuitively 
(based  on  subjects'  presumed  familiarity  with  the  topics;  Pitz,  1974), 
or  on  the  basis  of  subjects'  performance  in  the  experiment  (i.e.,  items 
for  which  more  subjects  chose  the  correct  alternative  were  taken  as 
easier;  Clarke,  1960;  Lichtenstein  &  Fischhoff,  1977).  The  latter 
strategy  leads  to  an  artlfactual  inflation  of  the  difference  in 
calibration  between  easy  and  hard  tests  that  is  difficult  to  separate 
from  valid  effects. 

To  avoid  this  artifact,  the  present  experiment  was  designed  to 
define  item  difficulty  a  priori.  Each  item  involved  two  numbers:  two 
populations,  two  distances,  or  two  years.  We  assumed  that  the  more 
similar  these  two  numbers,  the  harder  the  item  is  likely  to  be.  To  get 
a  measure  of  difficulty,  we  formed  the  ratio  of  the  larger  number  to 
the  smaller  number  (for  the  historical  events,  the  ratios  were  formed 
from  the  number  of  years  elapsed  since  the  events  occurred).  The  250 
items  with  the  largest  ratios  were  designated  as  easy;  the  rest  were 
called  hard.  The  ratios  varied  from  1.01  to  78.79;  the  median,  at 
which  the  hard/easy  division  was  made,  was  1.84. 

Instructions.  The  instructions  were  brief: 

For  each  question  select  the  answer  you  believe  to  be  correct 
(your  best  guess).  Then  assess  the  probability  that  your  answer 
is,  in  fact,  correct.  This  probability  can  be  any  number  from 


.50  to  1.0.  It  can  be  interpreted  as  your  degree  of  certainty 
about  the  correctness  of  your  answer.  For  example,  if  you  respond 
that  the  probability  is  .6,  it  means  that  you  believe  that 


there  are  about  6  chances  out  of  10  that  your  answer  is  correct. 
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A  response  of  1.0  means  that  you  are  absolutely  certain  that 
your  answer  Is  correct.  A  response  of  .5  means  that  your 
answer  Is  as  likely  to  be  right  as  wrong.  Since  there  are 
two  possible  answers,  If  you  make  a  pure  guess,  you  would 
probably  be  right  about  1/2  of  the  time — thus,  .50  would  be  an 
appropriate  probability  for  such  guesses.  Write  your  probability 
in  the  space  provided  on  the  answer  sheet. 

To  repeat,  your  probability  is  a  measure  of  your  degree  of 
certainty  that  your  chosen  alternative  is  the  correct 
alternative.  It  is  a  number  from  .50  to  1.0,  where  .50  means 
complete  uncertainty  and  1.0  means  complete  certainty. 

In  addition,  the  expert  and  trained  subjects  (all  of  whom  knew  about 
calibration)  were  asked  to  be  as  well  calibrated  as  possible. 

All  subjects  were  run  individually.  The  trained  and  expert 
subjects  who  did  not  live  in  Eugene  were  contacted  by  mall.  The  trained 
and  untrained  subjects  were  paid  for  their  participation;  the  experts 
were  not. 

Results 

Two  subjects,  one  untrained  and  one  trained,  apparently  misunderstood 
the  instructions  for  the  middle  section  of  the  test;  instead  of  picking 
the  city  farthest  from  the  base  city,  they  seemed  to  have  picked  the 
closest  city.  For  those  111  items,  the  untrained  subject  selected  the 
correct  alternative  only  361  of  the  time  and  the  trained  subject  only 
8%  of  the  time.  Furthermore,  they  erred  most  often  on  the  easiest  items. 
These  subjects  were  dropped  from  the  study.  The  results  that  follow 
are  thus  based  on  8  expert,  15  trained,  and  12  untrained  subjects. 
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j  Performance  measures 

Table  1  shows  the  mean  and  range  for  several  measures  of  subjects' 
performance.  The  experts  were  most  knowledgeable,  correctly  identifying 
j  an  average  of  75%  of  the  right  answers,  but  they  were  not  the  most 

i. 

confident. 


Overconfidence .  A  measure  of  overall  overconfidence  is  the  signed 
difference  between  the  mean  assessed  probability  and  the  proportion  of 
correct  alternatives  chosen.  A  positive  sign  Indicates  overconfidence; 
a  negative  sign,  underconfldence.  By  this  measure,  the  untrained  group 
was  predominantly  overconfident;  only  one  of  the  12  subjects  was 
underconfident.  Similar  overconfidence  has  been  found  in  previous 
studies  with  untrained  subjects  (Lichtenstein  et  al. ,  1977).  In 
contrast,  the  experts  were  all  underconf ident.  The  trained  group  was 
more  varied:  five  were  overconfident  and  ten  were  underconfident. 
Figure  1  shows  the  calibration  for  the  three  groups;  the  overconfidence 
of  the  untrained  group  (represented  by  a  curve  falling  below  the 
diagonal)  and  the  underconfldence  of  the  experts  are  readily  apparent. 


Calibration.  A  measure  of  calibration  proposed  by  Murphy  (1972,  1973) 

and  used  in  our  previous  work  (Lichtenstein  &  Fischhoff,  1977;  in  press) 

is  the  mean  squared  difference  between  the  assessed  probabilities  and  the 

corresponding  proportions  of  correct  answers,  weighted  by  the  number  of 

responses  at  each  point.  It  measures  the  mean  squared  vertical  distance, 

2 

in  a  plot  like  Figure  1,  between  the  points  and  the  diagonal.  The 
calibration  scores  appear  in  the  last  row  of  Table  1.  The  expert  and 
untrained  groups  were  indistinguishable  ( £  ■  .61;  £  >  .5),  whereas 
the  trained  group  was  significantly  better  than  both  the  others  ( £  *  2.44; 

£  •  .02) . 


Table  1 


Measure 


Proportion  Correct 
Mean  Response 

Overconf idence 


Means  (and  Ranges)  of  Performance  Measures 
for  all  500  Items  - 


Group 


Untrained  Trained 


.684 

(.630  to  .837) 


.667 

(.574  to  .786) 


.741 

(.664  to  .837) 


.648 

(.559  to  .764) 


+.057 

(-.008  to  +.165) 


-.029 

(-.121  to  +.046) 


Calibration 


.0118 

(.0013  to  .0339) 


.0055 

(.0009  to  .0176) 


Response 


Expert  • _ 

Trained  # _ 

Untrained  _ _ 


Figure  1.  Calibration  curves  for  the  three  groups  of  subjects. 

The  responses  of  all  subjects  within  each  group  were  combined. 

In  addition,  all  responses  less  than  1.0  were  grouped  into 
categories:  .5  to  .59,  .6  to  .69,  ...  ,  .9  to  .99.  The  proportion 
correct  in  each  category  is  here  plotted  against  the  mean  response 
in  each  category. 
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Use  of  1,0.  The  probabilistic  response  of  1.0  indicates  complete 
certainty,  and  previous  research  (Fischhoff,  Slovlc  &  Lichtenstein,  1977) 
has  shown  that  people  use  it  too  often  when  they  are,  in  fact,  wrong. 

The  untrained  subjects  replicated  this  finding.  They  used  1.0  for 
22. 4%  of  their  responses,  of  which  only  85%  were  correct  (Figure  1). 

The  trained  and  expert  groups  were  markedly  superior,  using  1.0  for 
9.6%  and  7.7%  of  their  responses,  respectively,  and  getting  96Z  and 
97%  of  those  correct.  Two  experts  never  used  1.0;  one  used  it  only  for 
correct  alternatives. 

Difficulty 

Our  procedure  for  separating  items  into  "easy"  and  "hard"  tests  proved 
quite  successful.  Over  all  35  subjects,  the  percent  correct  for  the  easy 
items  was  81.4;  for  the  hard  items  it  was  57.8.  For  all  subjects  but  one, 
the  percent  correct  on  the  easy  items  exceeded  the  percent  correct  on  the 
hard  items  by  at  least  16  percentage  points. 

The  effect  of  difficulty  on  calibration  was  striking  for  all  subjects. 
On  the  hard  items,  subjects  were  notably  overconfident  (or  at  least  much 
less  under confident;  six  of  the  eight  experts  were  still  slightly  under¬ 
confident  on  the  hard  items).  On  the  easy  items,  even  10  out  of  the  12 
untrained  subjects  were  underconf ident .  The  group  calibration  curves 
for  the  hard  and  easy  items  are  shown  in  Figure  2. 

Use  of  .5.  A  response  of  .5  should  represent  a  "pure  guess,"  as  likely 
to  be  wrong  as  right.  But  for  easy  items  the  percentage  correct  when 
our  subjects  responded  .5  was  substantially  greater  than  50%  for  all  three 
groups  (experts,  59.3%;  trained  subjects,  58.5%;  and  untrained  subjects, 
60.0%).  Some  experts'  use  of  the  response  .5  on  the  easy  items  was 


particularly  discouraging:  five  of  the  experts  responded  .5  for  19%  of 


the  easy  items  and  got  65%  of  them  right,  veil  above  chance  performance. 

The  other  three  experts  used  artificial  strategies  for  their  .5  responses. 
One  did  not  select  an  alternative  when  responding  .5  (for  data  analyses, 
the  computer  alternated  between  _a  and  b) .  Another  always  chose  alternative 
£  when  responding  .5,  and  the  third  adopted  a  constant-response  strategy 
near  the  beginning  of  the  task.  Two  of  the  trained  group  also  made 
strategic  choices  of  alternatives  when  responding  .5.  The  others  got 
61%  of  their  easy  .5's  right.  It  appears  that  even  those  well-schooled 
in  the  meaning  of  ".5"  tend  to  choose  the  correct  answer  to  easy 
questions  they  think  they  don't  know. 

Difficulty,  overconfidence,  and  calibration.  Consistent  with  previous 
research,  the  subjects  in  this  study  tended  to  be  overconfident  with  hard 
items  and  underconfident  with  easy  items.  A  necessary  (but  not 
sufficient)  condition  for  good  calibration  is  that  the  assessor  be  neither 
over-  nor  underconfident.  The  strong  relationship  between  difficulty 
and  overconfidence  suggests  that  there  is  an  "ideal"  difficulty  level 
for  which  an  assessor  will  be  neither  over-  nor  under confident  and  thus 
will  be  best  calibrated.  In  Figure  3,  overconfidence  is  plotted  against 
percentage  correct  for  each  subject  on  the  hard  and  easy  tests.  The 
straight  line  In  each  plot  connects  the  group  means.  These  data  suggest 
that  the  untrained  group  might  be  best  calibrated  on  a  test  on  which  they 
would  get  about  78%  of  the  items  correct.  Lichtenstein  and  Flschhoff  (1977) 
estimated  this  cross-over  point  at  "approximately  80%"  (p.  179)  for  a 
different  group  of  untrained  subjects.  The  trained  subjects  might  do 
best  on  a  test  with  63%  correct  (close  to  the  68%  they  achieved  in  this 
test  and  the  67%  they  scored  on  the  last  round  of  their  previous  training). 


and  the  experts  would  require  an  even  more  difficult  test,  58Z  correct. 


This  reasoning  suggests  that,  despite  the  apparently  clear-cut 
results  shown  in  Figure  1,  we  cannot  unequivocally  characterize  an 
assessor  or  group  as  "better  calibrated"  than  another  without  taking 
into  account  the  relationship  between  difficulty  and  overconfidence. 
Perhaps  the  trained  group  weren't  better;  they  were  just  lucky  to 
receive  a  test  with  an  overall  difficulty  level  about  the  same  as  the 
difficulty  of  their  training  and  thus  close  to  their  ideal. 

Sensitivity  to  changes  in  difficulty.  Assessors  would  perform 
better  if  they  were  more  sensitive  to  item  difficulty.  Ideally,  their 
response  distributions  would  shift  enough  to  make  changes  in  mean 
response  equal  to  changes  in  proportion  correct.  Completely  insensitive 
assessors  would  maintain  the  same  response  distribution  for  all 
difficulty  levels.  Letting  V  stand  for  mean  assessed  probability  and  P 
stand  for  proportion  correct,  and  using  subscripts  h  and  e  for  the  hard 
and  easy  tests,  the  ratio: 


expresses  the  degree  of  sensitivity  a  subject  shows  to  changes  in 
difficulty.  For  ideal  sensitivity,  the  index  would  be  equal  to  1.0. 

Values  below  1.0  indicate  undersensitivity;  values  above  1.0  indicate 
oversensitivity  to  changes  in  difficulty.  Table  2  shows  the  means  and 
ranges  of  this  sensitivity  ratio  for  the  three  groups  of  subjects. 

All  subjects  were  undersensltlve;  however,  here  (at  last!)  we  find  a  clear 
superiority  of  experts  over  the  other  two  groups.  Only  two  of  the 
untrained  subjects  and  three  of  the  trained  subjects  were  more  sensitive 
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FIGURE  3.  Overconfidence  plotted  egalnst  percentage  correct  for 
each  subject  on  the  hard  (circles)  and  easy  (triangles)  tests. 
The  solid  symbols  are  group  means. 


to  changes  in  difficulty  than  was  the  least  sensitive  expert.  The 
trained  subjects  were  not  significantly  better  than  the  untrained 
group  on  this  measure.^ 

Discussion 

At  first  glance,  our  probability  experts  appeared  no  better  calibrated 
than  the  untrained  group;  both  were  inferior  to  the  trained  group. 

Most  of  the  untrained  subjects  were  overconfident,  whereas  the  experts 
were  underconfident. 

We  will  never  know  the  calibration  of  our  experts  before  they 

started  studying  the  calibration  of  others.  However,  some  archival 

data  seem  relevant.  The  first  calibration  data  we  ever  collected 

used  19  employees  of  the  Oregon  Research  Institute  as  subjects.  Those 

subjects  were  similar  to  our  present  experts  in  that  they  were 

psychologists  studying  human  judgment  and  their  equally  knowledgeable 

secretaries  and  research  assistants.  But  they  knew  nothing  about 

calibration.  Their  responses  showed  the  now-familiar  severe  overconfidence 

* 

(Lichtenstein  et  al. ,  1977,  Figure  6).  The  underconfidence  of  the  present 

experts  seems  to  represent  a  change  in  behavior  prompted  by  research 

findings.  Apparently,  they  were  determined  not  to  be  overconfident  and, 

in  their  zeal,  they  over-corrected,  becoming  underconfident  over  a 

wide  range  of  difficulty  levels. 

If  this  interpretation  is  correct,  these  results  are  moderately 

encouraging.  People  can  learn  from  the  experience  of  others.  Seeders 

of  this  article,  having  learned  both  of  the  general  tendency  to  be 

overconfident  and  of  the  possibility  of  overcorrection,  should  therefore 

a 

be  able  to  produce  well-calibrated  responses. 


Mean 


IS 


Table  2 

Sensitivity  Ratio 


Group 


Untrained  Trained  Experts 


.37  .30  .60 


Range 


04  to  .68 


02  to  .65 


45  to  .73 
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A  more  detailed  examination  of  the  data  indicated  that  the  labeling 
of  one  group  of  assessors  as  "better  calibrated"  than  another  is  rendered 
moot  by  the  systematic  relationship  between  difficulty  and  overconfidence. 
This  relationhsip  is  apparently  mediated  by  an  Insensitivity  to  changes 
in  difficulty.  The  experts  were  superior  to  both  other  groups  in  being 
more  sensitive  to  changes  in  difficulty  of  the  items,  but  the  hard/easy 
effect  was  still  substantial  in  the  expert  group. 

The  hard/easy  effect  can  be  viewed  as  a  regression  effect:  If  the 
correlation  between  the  ease  of  item  (defined  by  some  objective  criterion) 
and  proportion  correct  is  greater  than  the  correlation  between  ease  of 
item  and  mean  response,  as  shown  in  Figure  4,  then  for  easy  items,  one 
would  expect  to  observe  under confidence  (Pg  >  whereas  for  the  hard 

items,  one  would  expect  overconfidence  >  P^) . 

Some  Introspection  suggests  how  hard  it  would  be  to  be  fully 
sensitive  to  test  difficulty.  Whether  defined  in  terms  of  the  proportion 
of  people  getting  an  item  correct  (as  in  Lichtenstein  &  Flschhoff,  1977) 
or  by  reference  to  some  measure  related  to  an  item's  content  (as  done 
here) ,  item  difficulty  does  not  appear  to  be  a  piece  of  information 
above  and  beyond  one's  general  feelings  of  uncertainty  about  which 
answer  is  correct.  Intuitively,  easy  items  are  just  those  to  which  one 
is  Inclined  to  state  a  high  probability,  while  hard  items  are  those 
about  which  one  is  quite  unsure.  Sensitivity  to  difficulty  might 
require  a  counter-intuitive,  two-stage  process,  involving  the  assessment 
of  both  personal  uncertainty  and  item  difficulty  ("I'm  pretty  sure  I 
know  the  answer,  but  this  seems  like  a  hard  item,  so  I'd  better  lower 
my  confidence")  before  arriving  at  a  probability  assessment.  Indeed, 
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although  both  the  present  authors  were  above  average  (even  among  the 
experts)  in  sensitivity  to  difficulty,  we  did  not  consciously  use  such 
a  two-stage  process,  and  do  not  know  how  we  did  as  well  as  we  did. 
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Footnotes 

We  acknowledge  with  thanks  our  discussions  with  Paul  Slovic  and 
Daniel  Kahneman.  This  research  was  supported  in  part  by  the  Advanced 
Research  Projects  Agency  of  the  Department  of  Defense,  monitored  by  the 
Office  of  Naval  Research  under  Contract  N00014-79-C-0029  (ARPA  Order 
No.  3668)  to  Perceptronics ,  Inc.,  and  in  part  from  the  Office  of 
Naval  Research  under  Contract  N00014-80-C-0150  to  Perceptronics,  Inc. 
Requests  for  reprints  may  be  addressed  to  Sarah  Lichtenstein,  Decision 
Research,  1201  Oak  Street,  Eugene,  Oregon  97401. 

1.  Our  deep  thanks  to  Barbara  Combs,  Dennis  Fryback,  Barbara 
Goodman,  Don  MacGregor,  Gordon  Pltz,  and  David  Seaver  for  serving  as 
our  expert  subjects. 

2.  Because  this  measure  is  artif actually  increased  by  the 
infrequent  use  of  two-digit  probabilities  (e.g. ,  .95),  the  data  were 

grouped  (.5-. 59,  .6-. 69, . 9-. 99,  1.0)  for  calculating  the 

calibration  index. 

3.  One-way  ANOVA:  F  “  10.4;  £  "  .0003.  Trained  vs.  untrained, 

£  -  1.20,  £  >  .2;  trained  vs.  expert,  £  •  3.57,  £  »  .001;  untrained  vs. 
expert,  Jt  -  4.44,  £  -  .0001. 

4.  We  are  willing  to  analyze  the  results  for  the  first  40  readers 
who  write  us  to  accept  this  challenge.  Tou  will  have  the  advantage  of 
knowing  the  approximate  difficulty  of  the  test. 
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